Improved Keyword-Spotting Using Sri's Decipher(Tm) Large-Vocabuarly Speech-Recognition System
نویسنده
چکیده
The word-spotting task is analogous to text-based informarion retrieval tasks and message-understanding tasks in that an exhaustive accounting of the input is not required: only a useful subset of the full information need be extracted in the task. Traditional approaches have focussed on the keywords involved. We have shown that accounting for more of the data, by using a large-vocabulary recognizer for the wordspotting task, can lead to dramatic improvements relative to traditional approaches. This result may well be generalizable to the analogous text-based tasks. The approach described makes several novel contributions, including: (1) a method for dramatic improvement in the FOM (figure of merit) for word-spotting results compared to more traditional approaches; (2) a demonstration of the benefit of language modeling in keyword spotting systems; and (3) a method that provides rapid porting of to new keyword vocabularies. 1. I N T R O D U C T I O N Although both continuous speech recognition and keyword-spotting tasks use the very similar underlying technology, there are typically significant differences in the way in which the technology is developed and used for the two applications (e.g. acoustic model training, model topology and language modeling, filler models, search, and scoring). A number of HMM-based systems have previously been developed for keyword-spotting [1-5]. One of the most significant differences between these keyword-spotting systems and a CSR system is the type of non-keyword model that is used. It is generally thought that very simple non-keyword models (such as a single 10-state model [2], or the set of monophone models [1]) can perform as well as more complicated non-keyword models which include words or triphones. We describe how we have applied CSR techniques to the keyword-spotting task by using a speech recognition system to generate a transcription of the incoming spontaneous speech which is searched for the keywords. For this task we have used SR.I's DECIPI-IER TM system, a state-of-the-art large-vocabulary speaker-independent continuous-speech recognition system [610]. The method is evaluated on two domains: (1) the Air Travel Information System (ATIS) domain [13], and (2) the "credit card topic" subset of the Switchboard Corpus [11], a telephone speech corpus consisting of spontaneous conversation on a number of different topics. In the ATIS domain, for 78 keywords in a vocabulary of 1200, we show that the CSR approach significantly outperforms the traditional wordspotting approach for all false alarm rates per hour per word: the figure of merit (FOM) for the CSR recognizer is 75.9 compared to only 48.8 for the spotting recognizer. In the Credit Card task, the sporing of 20 keywords and their 58 variants on a subset of the Switchboard corpus, the system's performance levels off at a 66% detection rate, limited by the system's ability to increase the false alarm rate. Additional experiments show that varying the vocabulary size from mediumto largevocabulary recognition systems (700 to 7000) does not affect the FOM performance. A set of experiments compares two topologies: (1) a topology for a fixed vocabulary for the keywords and the N most common words in that task (N varies from Zero to Vocabulary Size), forcing the recognition hypothesis to choose among the allowable words (traditional CSR), and (2) a second topology in which a background word model is added to the word list, thereby allowing the recognition system to transcribe parts of the incoming speech signal as background. While including the background word model does increase the overall likelihood of the recognized transcription, the probability of using the background model is highly likely (due to the language model probabilities of out of vocabulary words) and tended to replace a number of keywords that had poor acoustic matches. Finally, we introduce an algorithm for smoothing language model probabilities. This algorithm combines small taskspecific language model training data with large task-independent language training data, and provided a 14% reduction in test set perplexity. 2. T R A I N I N G . 2.1. A c o u s t i c M o d e l i n g DECIPHER TM uses a hierarchy of phonetic contextdependent models, including word-specific, triphone, generalized-triphone, biphone, generalized-biphone, and context independent models. Six spectral features are used to model the speech signal: the eepstral vector (C1-CN) and its first and second derivatives, and cepstral energy (CO) and its first and second derivatives. These features are computed from an FFT filterbank and subsequent high-pass RASTA filtering of the filterbank log
منابع مشابه
Training Set Issues in SRI's DECIPHER Speech Recognition System
SRI has developed the DECIPHER system, a hidden Markov model (HMM) based continuous speech recognition system typically used in a speaker-independent manner. Initially we review the DECIPHER system, then we show that DECIPHER's speakerindependent performance improved by 20% when the standard 3990-sentence speaker-independent test set was augmented with training data from the 7200-sentence resou...
متن کاملComparison of keyword spotting methods for searching in speech
This paper presents and discusses keyword spotting methods for searching in speech. In contrast with searching in text, the searching in speech or generally in multimedia data still represents a challenge. The aim of the paper is to present a keyword spotting (KWS) method based on a large vocabulary continuous speech recognition (LVCSR) system, based on phonetics decoder, and keyword spotting u...
متن کاملNon-Uniform Boosted MCE Training of Deep Neural Networks for Keyword Spotting
Keyword spotting can be formulated as a non-uniform error automatic speech recognition (ASR) problem. It has been demonstrated [1] that this new formulation with the nonuniform MCE training technique can lead to improved system performance in keyword spotting applications. In this paper, we demonstrate that deep neural networks (DNNs) can be successfully trained on the non-uniform minimum class...
متن کاملNew efficient fillers for unlimited word recognition and keyword spotting
This paper describes our complete results for improved lexical llers as well as two new kinds of llers, gives their results in unlimited speech recognition as well as for keyword spotting and compares them to the acoustic-phonetic ller in the case of keyword spotting. Tests have been conducted on di erent vocabularies derived from ATIS and the Wall Street Journal database. Results for keyword s...
متن کاملRecent improvements in SRI's keyword detection system for noisy audio
We present improvements to a keyword spotting (KWS) system that operates in highly adverse channel conditions with very low signal-to-noise ratio levels. We employ a system combination approach by combining the outputs of multiple large vocabulary continuous speech recognition (LVCSR) systems. These systems are complementary thanks to different design decisions across all levels of information:...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1993